Clustering Validity Assessment: Finding the Optimal Partitioning of a Data Set
نویسندگان
چکیده
Clustering is a mostly unsupervised procedure and the majority of the clustering algorithms depend on certain assumptions in order to define the subgroups present in a data set. As a consequence, in most applications the resulting clustering scheme requires some sort of evaluation as regards its validity. In this paper we present a clustering validity procedure, which evaluates the results of clustering algorithms on data sets. We define a validity index, S_Dbw, based on well-defined clustering criteria enabling the selection of the optimal input parameters’ values for a clustering algorithm that result in the best partitioning of a data set. We evaluate the reliability of our index both theoretically and experimentally, considering three representative clustering algorithms ran on synthetic and real data sets. Also, we carried out an evaluation study to compare S_Dbw performance with other known validity indices. Our approach performed favorably in all cases, even in those that other indices failed to indicate the correct partitions in a data set.
منابع مشابه
Clustering validity assessment using multi representatives
Clustering is a mostly unsupervised procedure and the majority of the clustering algorithms depend on certain assumptions in order to define the subgroups present in a data set. Moreover, they may behave in a different way depending on the features of the data set and their input parameters values. Therefore, in most applications the resulting clustering scheme requires some sort of evaluation ...
متن کاملAssessment of the Performance of Clustering Algorithms in the Extraction of Similar Trajectories
In recent years, the tremendous and increasing growth of spatial trajectory data and the necessity of processing and extraction of useful information and meaningful patterns have led to the fact that many researchers have been attracted to the field of spatio-temporal trajectory clustering. The process and analysis of these trajectories have resulted in the extraction of useful information whic...
متن کاملChoosing the Best Hierarchical Clustering Technique Based on Principal Components Analysis for Suspended Sediment Load Estimation
1- INTRODUCTION The assessment of watershed sediment load is necessary for controling soil erosion and reducing the potential of sediment production. Different estimates of sediment amounts along with the lack of long-term measurements limits the accessibility to reliable data series of erosion rate and sediment yield. Therefore, the observed data of suspended sediment load could be used to ...
متن کاملEvaluating the validity of clustering results based on density criteria and multi-representatives
Although the goal of clustering is intuitively compelling and its notion arises in many fields, it has been difficult to define a unified approach to address the clustering problem and thus diverse clustering approaches abound in the research community. These approaches are based on different clustering principles and assumptions and they often lead to qualitatively different results. As a cons...
متن کاملQuality Scheme Assessment in the Clustering Process
Clustering is mostly an unsupervised procedure and most of the clustering algorithms depend on assumptions and initial guesses in order to define the subgroups presented in a data set. As a consequence, in most applications the final clusters require some sort of evaluation. The evaluation procedure has to tackle difficult problems, which can be qualitatively expressed as: i. quality of cluster...
متن کامل